Lowest common denominator format. I don't have to assume any technical knowledge on your part to assume you can understand how to use a CSV, which anybody who has ever touched a default spreadsheet tool could use. I prefer HDF5 myself for larger dataframes, but I'm used to interfacing with all kinds of people, especially open data folks. I've learned something pretty quickly: don't assume people know anything about technology. Always lean to the format that is accessible. I expect I will have to move on from HDF5 if I want to learn how to deal with more distributed computing type problems. thanks bruv, let me know if you still want that CSV
Once again, yes I would like to see that data. I'd also like to see that link to your lawyer's website you talked about in the other thread. For someone teaching the basics of "data" science, you seem to have an aversion to data itself.
email me through the board, let's set up a 10 minute phone call if you're so interested in knowing about me and data science. My aversion to redo work other people have already done is kinda rule #1 of programming. but here... http://www.kauffman.org/microsites/...-index-of-entrepreneurial-activity-data-files KISA Data year X to 2014 is what you want. I'm sure you know enough to get that into different file formats if needed You will need the Cookbook (in the link I attached) to understand the column labels. you are probably most interested in immigr (simple binary categorical variable) and natvty for more granular data throughout the different years, though natvty is a numbered column that corresponds to country codes you may want to store in a dictionary. You will probably want to stitch together the CSVs, convert them to a different file format for performance issues (hey, Feather could work), and then use Pandas to slice and dice through the years to get the growth figures for native entrepreneurship vs non-native. I suggest reading them into Pandas then wrangling them there before optimizing them for performance--should be fairly simple as all of the datasets have the same number of features, and you can easily categorize which timeframe a value comes from by appending a dummy combined year-month column to each CSV, or by referring to the year columns if you don't want to go into month-by-month detail. You will want to graph the change in ent015u as compared to demographic factors to do your own in-depth analysis of the contested figure. You will not, I repeat NOT want to use any spreadsheet tool as each individual CSV is quite large. You may also want to check the CPS for cross-reference and you want to see what person_ids correspond to. May contain more qualitative data the categories of data here don't capture. My immigration attorney: http://gordonlawgrouppc.com/team/gali-schaham-gordon/. Don't contact her unsolicited unless you're willing to sink in 1-1.5k. Thanks.
Whatever you use, be aware of performance and memory issues. They're not that large but stitching everything together won't be totally trivial unless you're doing it programmatically.
Cohete! Apparently I have to repeat myself: Straight from the horse's mouth and the Codebook itself: immigr immigrant natvty country of birth (see Appendix 4 for codes) You even have, for s**ts and giggles: spneth Spanish ethnicity Here is our original discussion: I've provided you data to show that immigrants found businesses at twice the rate of natives. This is a discussion independent of immigration status. You like to facet your data apparently, well, don't let changing the goalposts do anything for you. Re: wages http://www.cato.org/blog/immigrations-real-impact-wages-employment David Card (2012) COMMENT: THE ELUSIVE SEARCH FOR NEGATIVE WAGE IMPACTS OF IMMIGRATION http://davidcard.berkeley.edu/papers/jeea2012.pdf
Look to my edited post. You've moved the goalposts, and you know it. I've provided you data to show that immigrants found businesses at twice the rate of natives. This is a discussion independent of immigration status. You like to facet your data apparently, well, don't let changing the goalposts do anything for you. It's okay just to say "I'm lazy, and changing the goalposts is what I'm currently doing." I'd like to see your "incomplete code". Please share in a Github repo, or format it here if you have privacy issues.
As for changing the goalposts, you don't even like thinking of your data too deeply, as it turns out. I guess the faceting perspective only comes out of things spoonfed to you. http://fivethirtyeight.com/datalab/how-do-we-know-how-many-undocumented-immigrants-there-are/ http://www.pewhispanic.org/2008/10/02/appendix-a-methodology-2/ For 1820-2012, you have data on legal permanent residents from the DHS. You can extrapolate a reasonably confident interval of illegal immigrants if that's your thing. https://www.dhs.gov/publication/yearbook-immigration-statistics-2012-legal-permanent-residents Have fun!
Among Hispanics, immigrants were about twice as likely as those born in the U.S. to be self-employed, by 11% to 6%. Almost one-in-five (17%) white immigrants were self-employed in 2014, compared with 11% of whites who were U.S. born. http://www.pewsocialtrends.org/2015/10/22/immigrants-contributions-to-job-creation/ Have fun.
I haven't moved any goalposts. Here is the primer on the code and such: Getting started: Spoiler Code: # To download R, go to the following link # and choose a mirror server closest to your location: # https://cran.r-project.org/mirrors.html # Then download the appropriate version for your operating system # # if you want to learn more about the packages available at CRAN, # start with task views: # https://cran.r-project.org/web/views/ # # if you want to read about a package's functionality, try for example: # vignette(package = "rvest") # # browse the list of topics and select one # vignette(package = "rvest", topic = "selectorgadget") # # to run R, download RStudio for your operating system at the folowing link: # https://www.rstudio.com/products/rstudio/download/ About memory: Spoiler Code: # R loads everything into virtual memory. # Use a system program to monitor your system memory usage. # Use the following functions to monitor your R memory (i.e. lsos()). # Found at: http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session # # improved list of objects .ls.objects <- function (pos = 1, pattern, order.by, decreasing=FALSE, head=FALSE, n=5) { napply <- function(names, fn) sapply(names, function(x) fn(get(x, pos = pos))) names <- ls(pos = pos, pattern = pattern) obj.class <- napply(names, function(x) as.character(class(x))[1]) obj.mode <- napply(names, mode) obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class) obj.size <- napply(names, object.size) obj.dim <- t(napply(names, function(x) as.numeric(dim(x))[1:2])) vec <- is.na(obj.dim)[, 1] & (obj.type != "function") obj.dim[vec, 1] <- napply(names, length)[vec] out <- data.frame(obj.type, obj.size, obj.dim) names(out) <- c("Type", "Size", "Rows", "Columns") if (!missing(order.by)) out <- out[order(out[[order.by]], decreasing=decreasing), ] if (head) out <- head(out, n) out } # shorthand lsos <- function(..., n=10) { .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n) } The code I will share for now will download and read the CSV's into the program. It will keep only a dozen or so of the columns. The total amount of memory used when I ran this script was about 1.9 giga bytes. Spoiler Code: # Load Dependencies ------------------------------------------------------- install.packages(c("rvest", "plyr", "dplyr")) # install first library(rvest) library(plyr) ## for optional ldply() library(dplyr) # Create, read, and parse url for the unique hrefs ------------------------ url <- "http://www.kauffman.org/microsites/kauffman-index/about/archive/kauffman-index-of-entrepreneurial-activity-data-files" hrefs <- url %>% read_html() %>% html_nodes("a") %>% html_attr("href") hrefs.csv <- hrefs[grep("csv", hrefs)] %>% unique # Create file names to save downloads ------------------------------------- d.file <- paste0("k.", 2014:1996, ".csv") # Download the csv files in hrefs.csv ------------------------------------- # these files will be given the names in d.file # files will download to your working directory # getwd() is your working directory; setwd() sets your working directory d.status <- mapply(FUN = download.file, url = hrefs.csv, destfile = d.file) # sum(d.status) should equal 0 # Read all of the downloaded files ---------------------------------------- # if you want to UNION all the data frames in the list: # then delete the "#" in "# %>% ldply()" # then ignore names(k.data) k.data <- lapply(d.file, function(x){ # the quick and dirty approach to finding and declaring column data types classes <- readLines(x, n = 2L) %>% textConnection() %>% read.csv(stringsAsFactors = FALSE) %>% sapply(class) classes[classes == "logical"] <- "numeric" classes[classes == "integer"] <- "numeric" x.file <- read.csv(x, colClasses = classes) %>% tbl_df() %>% select(month, year, immigr, age, ent015u, ent015ua, faminc, grdatn, hours, state, class, mlr, indmaj2, class_t1, mlr_t1, indmaj2_t1, pid) }) # %>% ldply() names(k.data) <- 2014:1996 # add names to the list or IGNORE if using ldply() Toodles.
what the f**k are you talking about. Yeah, you referred to H1B holders in a totally different post then the one that prompted your whole demand for data.
You went to this much effort to show you could read data and didn't even start wrangling it beyond selectively choosing a few columns and setting the dataframe? You left so many random comments behind that I have to conclude this is from some tutorial. It would have taken you a few more steps just to filter through with dplyr. A few problems here: 1) Why did you choose these 12 columns? I didn't see you choose natvty or crucially month (are you going to be doing your analysis year-by-year?). Curious as to your thinking behind why family income matters in this debate? 2) You remind me of why I hate R syntax and generally try to avoid it, but even a cursory reading of this (and maybe why the file size is so large) shows me that you're ingesting a lot of CSVs for no reason! If you parse all CSV links you'll also end up with National Components Data 2015 National Demographic Components Data 2015 State Components Data 2015 Metro Area Components Data 2015 All Geographies Components Data 2015 And a blowjob of a mess when it comes to wrangling Does rvest not have filtering/parsing options like requests and beautifulsoup? o_0 3) And I guess the major problem, for somebody who likes aggregating and faceting data, why did you stop at importing them into memory, copy+pasting 2 blocks of core R documentation (including of all things, installation instructions?!), and then calling it a day? You honestly are closer to actionable insights with a few lines of code then the time it took you to copy + paste random documentation? Unless your tbl_df is all f**ked up which judging by how it's been imported, I would guess would be the case. I mean, I don't even mess with R, but damn dude, this is pretty lazy.
I don't know how R deals with time series analysis, I imagine there must be some really good libraries out there somewhere, but Pandas is ace for that s**t. In case you wanted to facet by time periods (you do.) http://pandas.pydata.org/pandas-docs/stable/timeseries.html
I've brought up H1B visas in this thread (see above quote), and even corrected you on Sergey Brin's immigration status. This is the code I am sharing for now - as I said above. I provided information to get you started in R. If you, or any other CF member, should need further assistance, please let me know.
As far as parsing options for rvest, it uses xml2 for parsing and is built on rcurl. You could try using httr to send individual request, but I think much of that is beyond the scope of this thread. rvest and download.file should be enough to download the data. Feel free to show whatever alternate code you may have.
I'm aware of that. I'm telling you your columns and your rows are probably screwed up as a function of how you designed your tbl_df and how you combined different CSVs. I am not questioning your use of tbl_df given you're dealing with a larger though not unpleasantly large data set, I am questioning your particular tbl_df and the logic of how you decided to wrangle data.
or you can not use R, which until Hadley Wickham came about, utterly sucked balls at taking information from the web. No, you don't have to tell me what libraries in R are using one another, that's not the problem I'm bringing up. Or you could even manually have stored things instead of scraping things together if you couldn't do it properly in R. jesus. One of these weekends, if I have the time, I'll do the whole ******* thing with requests + beautifulsoup and it'll take a hell of a lot less time and documentation to get through it all. Of course this is your data you asked for to "facet and aggregate". Is this what is taught in the States? maybe that's why H1B visas are so in demand.