1. Welcome! Please take a few seconds to create your free account to post threads, make some friends, remove a few ads while surfing and much more. ClutchFans has been bringing fans together to talk Houston Sports since 1996. Join us!

Immigration and Jobs

Discussion in 'BBS Hangout: Debate & Discussion' started by Northside Storm, Mar 7, 2016.

  1. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    For someone so proficient with Python, why did you recommend a CSV and not Feather?
     
  2. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    Lowest common denominator format. I don't have to assume any technical knowledge on your part to assume you can understand how to use a CSV, which anybody who has ever touched a default spreadsheet tool could use.

    I prefer HDF5 myself for larger dataframes, but I'm used to interfacing with all kinds of people, especially open data folks. I've learned something pretty quickly: don't assume people know anything about technology. Always lean to the format that is accessible.

    I expect I will have to move on from HDF5 if I want to learn how to deal with more distributed computing type problems.

    thanks bruv, let me know if you still want that CSV
     
    #102 Northside Storm, May 18, 2016
    Last edited: May 18, 2016
  3. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    Once again, yes I would like to see that data. I'd also like to see that link to your lawyer's website you talked about in the other thread. For someone teaching the basics of "data" science, you seem to have an aversion to data itself.
     
  4. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    email me through the board, let's set up a 10 minute phone call if you're so interested in knowing about me and data science.

    My aversion to redo work other people have already done is kinda rule #1 of programming. :rolleyes:

    but here...

    http://www.kauffman.org/microsites/...-index-of-entrepreneurial-activity-data-files

    KISA Data year X to 2014 is what you want. I'm sure you know enough to get that into different file formats if needed :rolleyes:

    You will need the Cookbook (in the link I attached) to understand the column labels. you are probably most interested in immigr (simple binary categorical variable) and natvty for more granular data throughout the different years, though natvty is a numbered column that corresponds to country codes you may want to store in a dictionary. You will probably want to stitch together the CSVs, convert them to a different file format for performance issues (hey, Feather could work), and then use Pandas to slice and dice through the years to get the growth figures for native entrepreneurship vs non-native. I suggest reading them into Pandas then wrangling them there before optimizing them for performance--should be fairly simple as all of the datasets have the same number of features, and you can easily categorize which timeframe a value comes from by appending a dummy combined year-month column to each CSV, or by referring to the year columns if you don't want to go into month-by-month detail. You will want to graph the change in ent015u as compared to demographic factors to do your own in-depth analysis of the contested figure.

    You will not, I repeat NOT want to use any spreadsheet tool as each individual CSV is quite large.

    You may also want to check the CPS for cross-reference and you want to see what person_ids correspond to. May contain more qualitative data the categories of data here don't capture.

    My immigration attorney: http://gordonlawgrouppc.com/team/gali-schaham-gordon/. Don't contact her unsolicited unless you're willing to sink in 1-1.5k.

    Thanks.
     
    #104 Northside Storm, May 19, 2016
    Last edited: May 19, 2016
  5. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    I don't use Pandoc, but I will take a look. Thanks.
     
  6. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    Whatever you use, be aware of performance and memory issues. They're not that large but stitching everything together won't be totally trivial unless you're doing it programmatically.
     
  7. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    Cohete!

    Apparently I have to repeat myself:

    Straight from the horse's mouth and the Codebook itself:

    immigr immigrant
    natvty country of birth (see Appendix 4 for codes)

    You even have, for s**ts and giggles:

    spneth Spanish ethnicity

    Here is our original discussion:

    I've provided you data to show that immigrants found businesses at twice the rate of natives. This is a discussion independent of immigration status. You like to facet your data apparently, well, don't let changing the goalposts do anything for you.

    Re: wages

    http://www.cato.org/blog/immigrations-real-impact-wages-employment

    David Card (2012)
    COMMENT: THE ELUSIVE SEARCH FOR
    NEGATIVE WAGE IMPACTS OF IMMIGRATION

    http://davidcard.berkeley.edu/papers/jeea2012.pdf

     
    #107 Northside Storm, May 28, 2016
    Last edited: May 28, 2016
  8. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    Nope. Tells me absolutely nothing about whether someone is H1B visa, student, illegal, or other.
     
  9. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    Look to my edited post. You've moved the goalposts, and you know it.

    I've provided you data to show that immigrants found businesses at twice the rate of natives. This is a discussion independent of immigration status. You like to facet your data apparently, well, don't let changing the goalposts do anything for you.

    It's okay just to say "I'm lazy, and changing the goalposts is what I'm currently doing."

    I'd like to see your "incomplete code". Please share in a Github repo, or format it here if you have privacy issues.
     
    #109 Northside Storm, May 28, 2016
    Last edited: May 28, 2016
  10. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    As for changing the goalposts, you don't even like thinking of your data too deeply, as it turns out. I guess the faceting perspective only comes out of things spoonfed to you.

    http://fivethirtyeight.com/datalab/how-do-we-know-how-many-undocumented-immigrants-there-are/

    http://www.pewhispanic.org/2008/10/02/appendix-a-methodology-2/

    For 1820-2012, you have data on legal permanent residents from the DHS. You can extrapolate a reasonably confident interval of illegal immigrants if that's your thing.

    https://www.dhs.gov/publication/yearbook-immigration-statistics-2012-legal-permanent-residents

    Have fun!
     
  11. g1184

    g1184 Member

    Joined:
    Jan 28, 2003
    Messages:
    1,798
    Likes Received:
    86
  12. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    I haven't moved any goalposts.

    Here is the primer on the code and such:

    Getting started:
    Code:
    # To download R, go to the following link
    # and choose a mirror server closest to your location:
    #     https://cran.r-project.org/mirrors.html
    # Then download the appropriate version for your operating system
    # 
    #         if you want to learn more about the packages available at CRAN, 
    #         start with task views:
    #             https://cran.r-project.org/web/views/
    #
    #         if you want to read about a package's functionality, try for example:
    #             vignette(package = "rvest")
    #             # browse the list of topics and select one
    #             vignette(package = "rvest", topic = "selectorgadget")
    # 
    # to run R, download RStudio for your operating system at the folowing link:
    #     https://www.rstudio.com/products/rstudio/download/
    

    About memory:
    Code:
    # R loads everything into virtual memory.
    # Use a system program to monitor your system memory usage.
    # Use the following functions to monitor your R memory (i.e. lsos()).
    # Found at: http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session
    #
    # improved list of objects
    .ls.objects <- function (pos = 1, pattern, order.by,
                             decreasing=FALSE, head=FALSE, n=5) {
        napply <- function(names, fn) sapply(names, function(x)
            fn(get(x, pos = pos)))
        names <- ls(pos = pos, pattern = pattern)
        obj.class <- napply(names, function(x) as.character(class(x))[1])
        obj.mode <- napply(names, mode)
        obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
        obj.size <- napply(names, object.size)
        obj.dim <- t(napply(names, function(x)
            as.numeric(dim(x))[1:2]))
        vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
        obj.dim[vec, 1] <- napply(names, length)[vec]
        out <- data.frame(obj.type, obj.size, obj.dim)
        names(out) <- c("Type", "Size", "Rows", "Columns")
        if (!missing(order.by))
            out <- out[order(out[[order.by]], decreasing=decreasing), ]
        if (head)
            out <- head(out, n)
        out
    }
    # shorthand
    lsos <- function(..., n=10) {
        .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
    }
    

    The code I will share for now will download and read the CSV's into the program.
    It will keep only a dozen or so of the columns.
    The total amount of memory used when I ran this script was about 1.9 giga bytes.
    Code:
    
    # Load Dependencies -------------------------------------------------------
    
    install.packages(c("rvest", "plyr", "dplyr"))  # install first
    library(rvest)
    library(plyr)   ## for optional ldply()
    library(dplyr)
    
    
    # Create, read, and parse url for the unique hrefs ------------------------
    
    
    url <- "http://www.kauffman.org/microsites/kauffman-index/about/archive/kauffman-index-of-entrepreneurial-activity-data-files"
    hrefs <- url %>%
             read_html() %>%
             html_nodes("a") %>%
             html_attr("href")
    hrefs.csv <- hrefs[grep("csv", hrefs)] %>% unique
    
    
    # Create file names to save downloads -------------------------------------
    
    
    d.file <- paste0("k.",
                     2014:1996,
                     ".csv")
    
    
    # Download the csv files in hrefs.csv -------------------------------------
    
    
    # these files will be given the names in d.file
    # files will download to your working directory
    # getwd() is your working directory; setwd() sets your working directory
    d.status <- mapply(FUN = download.file, url = hrefs.csv, destfile = d.file)
    # sum(d.status) should equal 0
    
    
    # Read all of the downloaded files ----------------------------------------
    
    
    # if you want to UNION all the data frames in the list:
        # then delete the "#" in "# %>% ldply()"
        # then ignore names(k.data)
    k.data <- lapply(d.file, function(x){
        # the quick and dirty approach to finding and declaring column data types
        classes <- readLines(x, n = 2L) %>% textConnection() %>%
                   read.csv(stringsAsFactors = FALSE) %>%
                   sapply(class)
        classes[classes == "logical"] <- "numeric" 
        classes[classes == "integer"] <- "numeric"
        x.file <- read.csv(x, colClasses = classes) %>% tbl_df() %>%
                  select(month,
                         year,
                         immigr,
                         age,
                         ent015u,
                         ent015ua,
                         faminc,
                         grdatn,
                         hours,
                         state,
                         class,
                         mlr,
                         indmaj2,
                         class_t1,
                         mlr_t1,
                         indmaj2_t1,
                         pid)
    }) # %>% ldply()
    
    names(k.data) <- 2014:1996  # add names to the list or IGNORE if using ldply()
    

    Toodles.
     
  13. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    what the f**k are you talking about. Yeah, you referred to H1B holders in a totally different post then the one that prompted your whole demand for data.
     
  14. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    You went to this much effort to show you could read data and didn't even start wrangling it beyond selectively choosing a few columns and setting the dataframe? You left so many random comments behind that I have to conclude this is from some tutorial. It would have taken you a few more steps just to filter through with dplyr.

    A few problems here:

    1) Why did you choose these 12 columns? I didn't see you choose natvty or crucially month (are you going to be doing your analysis year-by-year?). Curious as to your thinking behind why family income matters in this debate?

    2) You remind me of why I hate R syntax and generally try to avoid it, but even a cursory reading of this (and maybe why the file size is so large) shows me that you're ingesting a lot of CSVs for no reason! If you parse all CSV links you'll also end up with

    National Components Data 2015
    National Demographic Components Data 2015
    State Components Data 2015
    Metro Area Components Data 2015
    All Geographies Components Data 2015

    And a blowjob of a mess when it comes to wrangling :confused:

    Does rvest not have filtering/parsing options like requests and beautifulsoup? o_0

    3) And I guess the major problem, for somebody who likes aggregating and faceting data, why did you stop at importing them into memory, copy+pasting 2 blocks of core R documentation (including of all things, installation instructions?!), and then calling it a day? You honestly are closer to actionable insights with a few lines of code then the time it took you to copy + paste random documentation?

    Unless your tbl_df is all f**ked up which judging by how it's been imported, I would guess would be the case.

    I mean, I don't even mess with R, but damn dude, this is pretty lazy.
     
    #114 Northside Storm, May 31, 2016
    Last edited: May 31, 2016
  15. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
  16. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    I've brought up H1B visas in this thread (see above quote), and even corrected you on Sergey Brin's immigration status.

    This is the code I am sharing for now - as I said above.

    I provided information to get you started in R. If you, or any other CF member, should need further assistance, please let me know.
     
  17. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    Btw, that's the way tbl_df is designed to look. It does not print all the columns nor all the rows.
     
  18. Cohete Rojo

    Cohete Rojo Contributing Member

    Joined:
    Oct 29, 2009
    Messages:
    10,344
    Likes Received:
    1,203
    As far as parsing options for rvest, it uses xml2 for parsing and is built on rcurl. You could try using httr to send individual request, but I think much of that is beyond the scope of this thread. rvest and download.file should be enough to download the data.

    Feel free to show whatever alternate code you may have.
     
  19. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    I'm aware of that. I'm telling you your columns and your rows are probably screwed up as a function of how you designed your tbl_df and how you combined different CSVs. I am not questioning your use of tbl_df given you're dealing with a larger though not unpleasantly large data set, I am questioning your particular tbl_df and the logic of how you decided to wrangle data.
     
    #119 Northside Storm, May 31, 2016
    Last edited: May 31, 2016
  20. Northside Storm

    Northside Storm Contributing Member

    Joined:
    Dec 24, 2007
    Messages:
    11,262
    Likes Received:
    450
    or you can not use R, which until Hadley Wickham came about, utterly sucked balls at taking information from the web. No, you don't have to tell me what libraries in R are using one another, that's not the problem I'm bringing up.

    Or you could even manually have stored things instead of scraping things together if you couldn't do it properly in R. jesus.

    One of these weekends, if I have the time, I'll do the whole ******* thing with requests + beautifulsoup and it'll take a hell of a lot less time and documentation to get through it all. Of course this is your data you asked for to "facet and aggregate".

    Is this what is taught in the States? maybe that's why H1B visas are so in demand.
     
    #120 Northside Storm, May 31, 2016
    Last edited: May 31, 2016

Share This Page

  • About ClutchFans

    Since 1996, ClutchFans has been loud and proud covering the Houston Rockets, helping set an industry standard for team fan sites. The forums have been a home for Houston sports fans as well as basketball fanatics around the globe.

  • Support ClutchFans!

    If you find that ClutchFans is a valuable resource for you, please consider becoming a Supporting Member. Supporting Members can upload photos and attachments directly to their posts, customize their user title and more. Gold Supporters see zero ads!


    Upgrade Now